Shutdown ClusterTopologyRefreshTask properly #2985

thachlp · 2024-09-12T06:18:41Z

Issue: #2904
Make sure that:

You have read the contribution guidelines.
You have created a feature request first to discuss your contribution intent. Please reference the feature request ticket number in the pull request.
You applied code formatting rules using the mvn formatter:format target. Don’t submit any formatting related changes.
You submit test cases (unit or integration tests) that back your changes.

ggivo · 2024-09-12T10:05:09Z

src/test/java/io/lettuce/core/cluster/RedisClusterClientIntegrationTests.java

+        Delay.delay(Duration.ofMillis(1500));
+        assertThat(clusterClient.isTopologyRefreshInProgress()).isTrue();


Hi, not sure about this.
Topology refresh in test env is quick, and there is no guarantee that we are in right state for the test.
Most likely it will be either not started yet or already completed when assert is performed making the test flaky. Also we are adding a delay to the tests as a hole.

I have run the suggested test and it failed 10 /10 times on the assert.

Do you have any idea to reproduce the issue?

Do you mean how to reproduce the failing test locally or the actual issue?
For the actual issue "java.util.concurrent.RejectedExecutionException", I tried to reproduce it but could not.
I will spend some more time on it next week and see if I can come up with an approach.

Hi @thachlp ,
I took a brief look and I think that this issue can be tested more easily using a unit test. There is a similar one testing client shutdown order already available. You can take a look at here

I suggest using a unit test to confirm that the pending ClusterTopologyRefreshTask is canceled/completed before shutting down the Executor group. We can inject a mock of the ClusterTopologyRefreshTask and complete it after client shutdown is initiated.

Hope it helps

ggivo · 2024-09-12T10:13:12Z

src/main/java/io/lettuce/core/cluster/RedisClusterClient.java

+     */
+    public void cancelTopologyRefresh() {
+        topologyRefreshScheduler.cancelTopologyRefreshTask();
+    }


There is already a suspendTopologyRefresh method invoked on Client.shutdown which will disable periodic topology refresh.
To my understanding what we are missing is cancelation of already submitted TopologyRefreshTask (if any) and also a logic for preventing submission of new one after Client.shutdown is initiated. Brief look at the code shows that the other code path triggering TopologyRefreshTask is when certain cluster events happens.

Hi, thanks for review 🙇

I think we are missing to cancel the running task, so cancelTopologyRefresh method is to do that.

@tishun @mp911de Do you have any suggestions?

Hey @thachlp , I will try to spend some time for this issue on Friday

Hey @thachlp ,

I think that, based on the comment in #2904 the request is to cancel any topology tasks automatically when we initiate shutdown:

When I close the RedisClusterClient by invoking RedisClusterClient.shutdown method, there is a chance that the ClusterTopologyRefreshTask is not stopped.
The issue was supposed to be resolved with #656 but it seems it was not completely working.

In #656 the idea of the fix was to drain all the existing cluster connections and cancel them upon shutdown. The user was never asked to call another method (and IMHO should not be asked)

tishun

Hey @thachlp,

Thanks for giving this fix a go. I think, however, you may be on the wrong path.

Judging from the stack trace in #2904 the ClusterTopologyRefreshScheduler attempts to refresh the topology AFTER the connections have been closed and the client is shutting down.

The suspendTopologyRefresh() is supposed to suspend any topology refresh tasks, but it seems there is some case (race condition perhaps?) where a task is still executed during shurdown.

tishun · 2024-10-14T14:29:48Z

src/main/java/io/lettuce/core/cluster/RedisClusterClient.java

+     */
+    public void cancelTopologyRefresh() {
+        topologyRefreshScheduler.cancelTopologyRefreshTask();
+    }


Hey @thachlp ,

I think that, based on the comment in #2904 the request is to cancel any topology tasks automatically when we initiate shutdown:

When I close the RedisClusterClient by invoking RedisClusterClient.shutdown method, there is a chance that the ClusterTopologyRefreshTask is not stopped.
The issue was supposed to be resolved with #656 but it seems it was not completely working.

In #656 the idea of the fix was to drain all the existing cluster connections and cancel them upon shutdown. The user was never asked to call another method (and IMHO should not be asked)

thachlp · 2024-11-04T09:34:37Z

Hey @thachlp,

Thanks for giving this fix a go. I think, however, you may be on the wrong path.

Judging from the stack trace in #2904 the ClusterTopologyRefreshScheduler attempts to refresh the topology AFTER the connections have been closed and the client is shutting down.

The suspendTopologyRefresh() is supposed to suspend any topology refresh tasks, but it seems there is some case (race condition perhaps?) where a task is still executed during shurdown.

From the Java docs of suspendTopologyRefresh

    /**
     * Suspend periodic topology refresh if it was activated previously. Suspending cancels the periodic schedule without
     * interrupting any running topology refresh. Suspension is in place until obtaining a new {@link #connect connection}.
     *
     * @since 6.3
     */
    public void suspendTopologyRefresh() {
        topologyRefreshScheduler.suspendTopologyRefresh();
    }

From my view, when we shut down RedisClusterClient, we should STOP running CANCEL scheduled tasks, that why I write STOP running tasks.

Thank @tishun for explaining to me, do you have any suggestion for the fix?

tishun · 2024-11-05T12:42:06Z

I will try to come back to you in the end of the week

mp911de · 2024-11-06T10:44:33Z

This PR introduces a check for a very specific scenario. The change doesn't necessary lead to a proper cancellation as the task itself is comprised from a series of refresh steps that are coupled through completable future's. Specifically, RedisClusterClient.refreshPartitionsAsync(…) is being called that has no notion of being interrupted.

I think conceptually the easiest approach is to synchronize (and wait) until ClusterTopologyRefreshTask has finished before shutting down ClientResources. ClusterTopologyRefreshTask would require a CompletableFuture<Void> that is being completed upon completion of Supplier<CompletionStage<?>>.

It would require also a bit of housekeeping, e.g.

if (isEventLoopActive()) {
    clientResources.eventExecutorGroup().submit(clusterTopologyRefreshTask);
    return true;
}

isn't atomic, EventExecutorGroup.submit(…) could return a failed future that requires consideration as well.

Kvicii · 2024-12-19T04:43:42Z

@tishun @mp911de @thachlp
Is there any follow-up? I have the same problem.
issue-3089

tishun · 2024-12-23T11:03:41Z

As @mp911de mentioned we need to devise a better solution to this problem.
He has explained this in his comment here, I also elaborated more in #2904

thachlp added 2 commits September 12, 2024 11:59

Shutdown ClusterTopologyRefreshTask when RedisClusterClient is shutdown

af62250

Update test

bb87e40

ggivo reviewed Sep 12, 2024

View reviewed changes

tishun changed the title ~~Shutdonw clustertopologyrefreshtask properly~~ Shutdown ClusterTopologyRefreshTask properly Sep 13, 2024

tishun requested changes Oct 14, 2024

View reviewed changes

tishun added the status: waiting-for-feedback We need additional information before we can continue label Oct 18, 2024

Karonazaba mentioned this pull request Dec 23, 2024

enablePeriodicRefresh has problematic behavior when service is shut down(close) #3089

Open

tishun mentioned this pull request Dec 23, 2024

BugReport: ClusterTopologyRefreshTask is not shutdown when RedisClusterClient is shutdown #2904

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown ClusterTopologyRefreshTask properly #2985

Shutdown ClusterTopologyRefreshTask properly #2985

thachlp commented Sep 12, 2024 •

edited

Loading

ggivo Sep 12, 2024

thachlp Sep 17, 2024

ggivo Sep 20, 2024

ggivo Sep 23, 2024 •

edited

Loading

ggivo Sep 12, 2024

thachlp Sep 17, 2024

tishun Sep 25, 2024

tishun Oct 14, 2024

tishun left a comment

tishun Oct 14, 2024

thachlp commented Nov 4, 2024

tishun commented Nov 5, 2024

mp911de commented Nov 6, 2024

Kvicii commented Dec 19, 2024

tishun commented Dec 23, 2024

		Delay.delay(Duration.ofMillis(1500));
		assertThat(clusterClient.isTopologyRefreshInProgress()).isTrue();

Shutdown ClusterTopologyRefreshTask properly #2985

Are you sure you want to change the base?

Shutdown ClusterTopologyRefreshTask properly #2985

Conversation

thachlp commented Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggivo Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tishun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thachlp commented Nov 4, 2024

tishun commented Nov 5, 2024

mp911de commented Nov 6, 2024

Kvicii commented Dec 19, 2024

tishun commented Dec 23, 2024

thachlp commented Sep 12, 2024 •

edited

Loading

ggivo Sep 23, 2024 •

edited

Loading